Goto

Collaborating Authors

 visual concept


Partially-Supervised Image Captioning

Neural Information Processing Systems

Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild --- for example, as assistants for people with impaired vision --- a much larger number and variety of visual concepts must be understood. To address this problem, we teach image captioning models new visual concepts from labeled images and object detection datasets. Since image labels and object classes can be interpreted as partial captions, we formulate this problem as learning from partially-specified sequence data. We then propose a novel algorithm for training sequence models, such as recurrent neural networks, on partially-specified sequences which we represent using finite state automata. In the context of image captioning, our method lifts the restriction that previously required image captioning models to be trained on paired image-sentence corpora only, or otherwise required specialized model architectures to take advantage of alternative data modalities. Applying our approach to an existing neural captioning model, we achieve state of the art results on the novel object captioning task using the COCO dataset. We further show that we can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.






Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

Medhini Narasimhan, Svetlana Lazebnik, Alexander Schwing

Neural Information Processing Systems

Accurately answering aquestionabout agivenimage requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains analgorithmic challenge. Toadvance research inthisdirection anovel'fact-based' visual question answering (FVQA) taskhas been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation.





Supplementary Material for Self-Supervised Visual Representation Learning with Semantic Grouping Xin Wen

Neural Information Processing Systems

There are two operations in our data augmentation pipeline that changes the scale or layout of the image, i.e ., random resized crop and random horizontal flip. This is followed by a resize operation to recover the intersect part to the original size ( e.g ., RoIAlign to recover the original spatial layout. The total stride is 16 (FCN-16s [20]). Intuitively, each prototype can be viewed as the cluster center of a semantic class. During inference, we only take the teacher model parameterized by ξ .